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Abstract 

We present a novel deep Recurrent Neural Network (RNN) 
model for acoustic modelling in Automatic Speech Recognition 
(ASR). We term our contribution as a TC-DNN-BLSTM-DNN 
model, the model combines a Deep Neural Network (DNN) 
with Time Convolution (TC), followed by a Bidirectional Long- 
Short Term Memory (BLSTM), and a final DNN. The first DNN 
acts as a feature processor to our model, the BLSTM then gen¬ 
erates a context from the sequence acoustic signal, and the final 
DNN takes the context and models the posterior probabilities of 
the acoustic states. We achieve a 3.47 WER on the Wall Street 
Journal (WSJ) eval92 task or more than 8% relative improve¬ 
ment over the baseline DNN models 

Index Terms: Deep Neural Networks, Recurrent Neural Net¬ 
works, Long-Short Term Memory, Asynchronous Stochastic 
Gradient Descent, Automatic Speech Recognition 

1. Introduction 

Deep Neural Networks (DNNs) and Convolutional Neural Net¬ 
works (CNNs) have yielded many state-of-the-art results in 
acoustic modelling for Automatic Speech Recognition (ASR) 
tasks dd). DNNs and CNNs often accept some spectral fea¬ 
ture (e.g., log-Mel filter banks) with a context window (e.g., +/- 
10 frames) as inputs and trained via supervised backpropaga- 
tion with softmax targets learning the Hidden Markov Model 
(HMM) acoustic states. 

DNNs do not make much prior assumptions about the input 
feature space, and consequently the model architecture is blind 
to temporal and frequency structural localities. CNNs are able 
to directly model local structural localities through the usage of 
convolutional filters. CNN filters connect to only a subset re¬ 
gion of the feature space and are tied and shared across the en¬ 
tire input feature, giving the model translational invariance 0. 
Additionally, pooling is often added, which yields rotational in¬ 
variance 12). The inherent structure of CNNs yields a model 
much more robust to small shifts and permutations. 

Speech is fundamentally a sequence of time signals. CNNs 
(with time convolution) can capture some of this time locality 
through the convolution filters, however CNNs may not be able 
to directly capture longer temporal signal patterns. For exam¬ 
ple, temporal patterns may span 10 or more frames, however the 
convolution filter width may only be 5 frames wide. The CNN 
model must then rely on the higher level fully connected layers 
to model these long term dependencies. Additionally, one size 
may not fit all, the frame width of phones and temporal patterns 
are of varying lengths. Optimizing the convolution filter size is 
a expensive procedure and corpora dependent 0. 

Recently, Recurrent Neural Networks (RNNs) have been 
introduced demonstrating power modelling capabilities for se¬ 
quences O EJ El [EJ- RNNs incorporate feedback cycles in the 


network architecture. RNNs include a temporal memory com¬ 
ponent (for example, in LSTMs the cell state |!9j), which allows 
the model to store temporal contextual information directly in 
the model. This relieves us from explicitly defining the size 
of temporal contexts (e.g., the time convolution filter size in 
CNNs), and allows the model to learn this directly. In fact in 
f8l . the whole speech sequence can be accumulated in the tem¬ 
poral context. 

There exist many implementations of RNNs flOl . LSTM 
and Gated Recurrent Units (GRUs) ED are particular imple¬ 
mentations of RNNs that are easy to train and do not suffer from 
the vanishing or exploding gradient problems when perform¬ 
ing Backpropagation Through Time (BPTT) fill . LSTMs have 
the capability to remember sequences with long range temporal 
dependencies (9j and have been applied successfully to many 
applications include image captioning El, end-to-end speech 
recognition ED and machine translation M- 

LSTMs process sequential signals in one direction. One 
natural extension is bidirectional LSTMs (BLSTMs), which is 
composed of two LSTMs. The forward LSTM process the se¬ 
quence as usual (e.g., reads the input sequence in the forward 
direction), the second processes the input sequence in backward 
order. The outputs of the two sequences can then be concate¬ 
nated. BLSTMs have two distinct advantages over LSTMs, the 
first advantage being the forward and backward passes of the se¬ 
quence yields differing temporal dependencies, the model can 
capture both sets of the signal dependencies. The second advan¬ 
tage is the higher level sequence layers (e.g., stacked BLSTMs) 
using the BLSTM outputs can access information from both in¬ 
put directions. 

LSTMs and GRUs (and their bidirectional variants) have re¬ 
cently been successfully applied to acoustic modelling and ASR 
001). In (5j TIMIT phone sequences were trained end-to- 
end from unsegmented sequence data using a LSTM transducer. 
LSTMs can be combined with Connectionist Temporal Classifi¬ 
cation (CTC) and implicitly perform sequence training over the 
speech signal on TIMIT q. tm used GRUs and generated an 
explicit alignment model between the TIMIT speech sequence 
data to the phone sequence. In 0 a commercial speech sys¬ 
tem is trained using a LSTM acoustic model, here the the entire 
speech sequence is used as the context for classifying context 
dependent phones. ED extend from [81 and applied sequence 
training on top of LSTMs. Our contribution in this paper is 
a novel deep RNN acoustic model which is easy to train and 
archives an 8% relative improvement over DNNs for the Wall 
Street Journal (WSJ) corpus. 

2. Model 

Our model architecture can be summarized as a TC-DNN- 
BLSTM-DNN acoustic model. Our model deals with fixed 



Figure 1: TC-DNN-BLSTM-DNN Architecture. The model contains 3 parts, a signal processing DNN which takes in the original 
fMLLR acoustic features and projects them to a high dimensional space, a BLSTM which models the sequential signal and produces a 
context, and a final DNN which takes the context generated by the BLSTM and estimates the likelihoods across acoustic states. 


length sequences (as opposed to variable length whole se¬ 
quences (8J) of a context window. The advantage of our model 
is we can easily use BLSTMs online (e.g., we don’t need to 
wait to see the end of the sequence to generate the backward 
direction pass of the LSTM). The disadvantage is however the 
amount of temporal information stored in the model is limited 
to the context width (e.g., similar to DNNs and CNNs). How¬ 
ever, in offline decoding, we can also compute all the acoustic 
states in parallel (e.g., one big minibatch) versus the 0(T ) it¬ 
erations needed by m due to the iterative dependency of the 
LSTM memory. 

The model begins with a fixed window context of acous¬ 
tic features (e.g., fMLLR) similar to a standard DNN or CNN 
acoustic model jUJUl- Within the context window, an over¬ 
lapping time window of features, or Time Convolution (TC) 
of features is fed in at each timestep. A similar approach was 
used by m, however they used a stride of 2 for the sake of 
reducing computational cost, however, our motivation is time 
convolution rather than performance and we use a stride of 1. 
The model processes these features with independent columns 
of DNNs over the context window timesteps. We refer this as 
the TC-DNN component of the model. The objective of the TC- 
DNN component is to project the original acoustic feature into 
a high dimensional feature space which can then be easily mod¬ 
elled or consumed by the LSTM. m refers to this as a Deep 
Input-to-Hidden Function. 

The transformed high dimensional acoustic signal is then 
fed into a BLSTM. The BLSTM models the time sequential 


component of the signal. Our LSTM implementation is similar 


to m and described in the equations below: 

it = <t>(W xi x t + Whiht-x) (1) 

ft = <t>(W x fX t + Whfht- i) (2) 

c t = ft® cst-i +i t ® tan h(W xc x t + W hc h t -i) (3) 
ot = (j>{W xo x t + W ho h t - 1 ) (4) 

ht = o t © tanh(c t ) (5) 


We do not use bias, nor peephole connections; on initial 
experimentation, we observed negligible difference, hence we 
omitted them in this work. Additionally, we did not apply any 
gradient clipping or gradient projection, we did however apply a 
cell activation clipping of 3 to prevent saturation in the sigmoid 
non-linearities. We found the cell activation clipping to help re¬ 
move convergence problems and exploding gradients. We also 
do not use a recurrent projection layer (8). We found our LSTM 
implementation to train very easily without exploding gradients, 
even with high learning rates. 

The BLSTM scans our input acoustic window of time width 
T emitted by the first DNN and outputs two fixed value vector 
(one for each direction), which is then concatenated: 


We refer c as the context of the acoustic signal generated by 
the BLSTM. Context c compresses the relevant acoustic infor- 































































































mation needed to classify the phones from the feature context 
(e.g., the window of fMLLR features). 

The context is further manipulated and projected by a sec¬ 
ond DNN. The second DNN adds additional non-linear trans¬ 
formations before being finally fed to the softmax layer to 
model the context dependent state posteriors. Q3 refers this 
as the Deep Hidden-to-Output Function. The model is trained 
supervised with backpropagation minimizing the cross entropy 
loss. Figure[T]gives a visualization of our entire model. 

3. Optimization 

We found our LSTM models to be very easy to train and con¬ 
verge. We initialize our LSTM layers with a uniform distri¬ 
bution U{— 0.01, 0.01), and our DNN layers with a Gaussian 
distribution A/”(0,0.001). We clip our LSTM cell activations to 
3, we did not need to apply any gradient clipping or gradient 
projection. 

We train our model with Stochastic Gradient Descent 
(SGD) using a minibatch size of 128, we found using larger 
minibatches (e.g., 256) to give slightly worse WERs. We used a 
simple geometric decay schedule, we start with a learning rate 
of 0.1 and multiply it by a factor of 0.5 every epoch. We have 
a learning rate floor of 0.00001 (e.g., the learning rate does not 
decay beyond this value). We experimented with both classi¬ 
cal and Nesterov momentum, however we found momentum to 
harm the final WER convergence slightly, hence we use no mo¬ 
mentum. We apply the same optimization hyperparameters for 
all our experiments, it is possible using a slightly different de¬ 
cay schedule will yield better results. Our best model took 17 
epochs to converge or around 51 hours in wall clock time with 
a NVIDIA Tesla K20 GPU. 

4. Experiments and Results 

We experiment with the WSJ dataset. We use si284 with ap¬ 
proximately 81 hours of speech as the training set, dev93 as our 
development set and eval92 as our test set. We observe the WER 
of our development set after every epoch, we stop training once 
the development set no longer improves. We report the con¬ 
verged dev93 and the corresponding eval92 WERs. We use the 
same fMLLR features generated from the Kaldi s5 recipe 1201 . 
and our decoding setup is exactly the same as the s5 recipe (e.g., 
large dictionary and trigram pruned language model). We use 
the tri4b GMM alignments as our training targets and there are a 
total of 3431 acoustic states. The GMM tri4b baseline achieved 
a dev and test WER of 9.39 and 5.39 respectively. 

4.1. DNN 

Two baseline DNN systems are presented, the first is the Kaldi 
s5 WSJ recipe with sigmoid DNN model which pretrains with 
a Deep Belief Network j2T), it achieved a WER of 3.81. 

We also built a ReLU DNN which requires no pretraining. 
The ReLU DNN consisted of 4 layers of 2048 ReLU neurons 
followed by softmax and trained with geometrically decayed 
SGD. We also experimented with deeper and wider networks, 
however we found this 5 layer architecture to be the best. Our 
ReLU DNN is much easier to train (e.g., no expensive pretrain¬ 
ing) and achieves a WER of 3.79 matching the WER of the pre¬ 
trained Sigmoid DNN. The ReLU DNN results suggest that pre¬ 
training may not be necessary given sufficient supervised data 
and is competitive for the acoustic modelling task. Table[T]sum- 
marizes the WERs for our DNN baseline systems. 


Table 1: WERs for Wall Street Journal. The ReLU DNN re¬ 
quires no pretraining and matches the WER of the Kaldi s5 
recipe which uses DBN pretraining. 


Model 

dev93 WER 

eval92 WER 

GMM Kaldi tri4b 

9.39 

5.39 

DNN Kaldi s5 

6.68 

3.81 

DNN ReLU 

6.84 

3.79 


Table 2: BLSTM WERs for Wall Street Journal. Larger recur¬ 
rent models tend to perform better without overfitting. The deep 
BLSTM models do not yield any substantial gains over their 
single layer counterparts. 


Cell Size 

Layers 

dev93 WER 

eval92 WER 

128 

1 

8.19 

5.19 

256 

1 

7.94 

4.66 

512 

1 

7.43 

4.36 

768 

1 

7.36 

4.16 

1024 

1 

7.23 

4.06 

256 

2 

7.54 

4.36 

512 

2 

7.40 

4.25 


4.2. Deep BLSTM 

We experimented with single layer and two layer deep BLSTM 
models. The cell size reported is per direction (e.g., total cells 
are doubled). The BLSTM models take longer to train and 
underperform compared to the ReLU DNN model. The large 
BLSTM models tend to outperform the smaller ones, suggest¬ 
ing overfitting is not an issue. However, there is limited incre¬ 
mental gain in WER performance with additional cells. Our 
best single layer BLSTM with 1024 bidirectional cells achieved 
only 4.06 WER compared to 3.79 from our ReLU DNN model. 

Deep BLSTM models [7j may give additional model per¬ 
formance, since the upper layers can access information from 
the shallow layers in both directions and additional layers of 
non-linearities are available. Our deep BLSTM models contain 
two layers, the cell size reported is per direction per layer (e.g., 
total cells are quadrupled). Our deep BLSTM experiments give 
mixed results. For the same number of cells per layer, the deep 
model performs slightly better. However, if we fixed the number 
of parameters, the single layer BLSTM model performs slightly 
better, the single layer of 1024 bidirectional cells achieved a 
WER of 4.16 while the deep two layer BLSTM model with 512 
bidirectional cells per layer achieved a WER 4.25. Table[2]sum- 
marizes our BLSTM experiment WERs. 

4.3. TC-DNN-BLSTM-DNN 

We experimented next with a DNN-BLSTM model. Our DNN- 
BLSTM model does not have time convolution at its input, and 
lacks the second DNN non-linearities for context projection. 
The two layer 2048 neuron ReLU DNN in front of the BLSTM 
acts as a signal processor, projecting the original acoustic signal 
(e.g., each fMLLR vector) into a new high dimensional space 
which can be more easily digested by the LSTM. The BLSTM 
module uses 128 bidirectional cells. Compared to the 128 bidi¬ 
rectional cell BLSTM model, the model improves from 5.19 
WER to 3.92 WER or 24% relatively. The results of this exper¬ 
iment suggest the fMLLR features may not be the best features 
for BLSTM models (to consume directly at least); but rather 













Table 3: Ablation effects of our TC-DNN-BLSTM-DNN model. 
The DNNs and Time Convolution are used for signal and con¬ 
text projections. We show that all components are critical to 
obtain the best performing model. 


Model 

dev93 WER 

eval92 WER 

DNN-BLSTM 

7.40 

3.92 

BLSTM-DNN 

6.90 

3.84 

DNN-BLSTM-DNN 

7.19 

3.76 

TC-DNN-BLSTM-DNN 

6.58 

3.47 


learnt features (through the DNN feature processor) can yield 
better features for the BLSTM model to consume. 

The next experiment we ran was a BLSTM-DNN model. 
Here, the BLSTM accepts the original acoustic feature without 
modification and emits a context. The context is passed through 
to a two layer 2048 neuron ReLU DNN which provides addi¬ 
tional layers of non-linear projections before classification by 
the softmax layer. Once again, the BLSTM module uses only 
128 bidirectional cells. The model improves from 5.19 WER to 
3.84 WER or 26% relatively when compared to the original 128 
bidirectional cell BLSTM model which does not have the con¬ 
text non-linearities. The result of this experiment suggest the 
LSTM context should not be used directly for softmax phone 
classification, but rather additional layers of non-linearities are 
needed to achieve the best performance. 

We then experimented with a DNN-BLSTM-DNN model 
(without time convolution). Each DNN has two layers of 2048 
ReLU neurons, and the BLSTM layer had 128 cells per direc¬ 
tion. We combine both the benefits of a learnt signal processing 
DNN and the context projection. Compared to a 128 bidirec¬ 
tional cell BLSTM model, our WER drops from 5.19 to 3.76 or 
28% relatively. Compared to a 1024 bidirectional cell BLSTM 
model, we essentially redistributed our parameters from a wide 
shallow network to a deeper network. We achieve a 11% rela¬ 
tive improvement compared to a single layer 1024 bidirectional 
cell BLSTM, suggesting the deeper models are much more ex¬ 
pressive and powerful. 

Finally, our TC-DNN-BLSTM-DNN model combines the 
DNN-BLSTM-DNN with input time convolution. Our model 
further improves from 3.76 WER without time convolution to 
3.47 WER with time convolution. Compared to the DNN mod¬ 
els, we achieve 0.32 absolute WER reduction or 8% relatively. 
To the best of our knowledge, this is the best WSJ eval92 per¬ 
formance without sequence training (22). We hypothesize the 
time convolution gives a richer signal representation to the DNN 
signal processor and consequently the BLSTM model to con¬ 
sume. The time convolution also relieves the LSTM computa¬ 
tion power to learning long term dependencies, rather than short 
term dependencies. Table[3]summarizes the experiments for this 
section. 

4.4. Distributed Optimization 

All results presented in the previous sections of this paper were 
trained with a single GPU with SGD. To reduce the time re¬ 
quired to train an individual model we also experimented with 
distributed Asynchronous Stochastic Gradient Descent (ASGD) 
across multiple GPUs. Our implementation is similar to ED, 
we have 4 GPUs (NVIDIA Tesla K20) in our system, 1 GPU 
is dedicated as a parameter server and we have 3 GPU compute 
shards (e.g., the independent SGD learners). We do not apply 


Table 4: Effects of distributed optimization for our TC-DNN- 
BLSTM-DNN model. The ASGD experiments uses 3 indepen¬ 
dent SGD shards. 


Model 

Epochs 

Time (hrs) 

dev93 WER 

eval92 WER 

SGD 

17 

51.5 

6.58 

3.47 

ASGD 

14 

16.8 

6.57 

3.72 
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Figure 2: SGD vs x3 ASGD WER convergence comparison, 
each point represents one epoch of the respective optimizer. 


any stale gradient decay 1231 or warm starting (24). We use the 
exact same learning rate schedule, minibatch size and hyper¬ 
parameters as our TC-DNN-BLSTM-DNN SGD baseline. j8j 
applied distributed ASGD optimization, however they applied it 
on a cluster of CPUs rather than GPUs. Additionally, f8} did not 
compare if there was a WER differential between SGD versus 
ASGD. 

Our baseline TC-DNN-BLSTM-DNN SGD system took 17 
epochs or 51 wall clock hours to converge to a dev and test WER 
of 6.58 and 3.47. Our distributed implementation converges in 
14 epochs and 16.8 wall clock hours, achieves a dev and test 
WER of 6.57 and 3.72. The distributed optimization is able 
to match the dev WER, however the test WER is significantly 
worse. It is unclear whether this WER differential is due to 
the asynchronicity characteristic of the optimizer or due to the 
small datasets, we suspect with larger datasets the gap between 
the ASGD and SGD will shrink. The conclusion we draw is that 
ASGD can converge much quicker and faster, however there 
may be a impact to final WER performance. Tableland Figure 
[2]summarizes our results. 

5. Conclusions 

In this paper, we presented a novel TC-DNN-BLSTM-DNN 
acoustic model architecture. On the WSJ eval92 task, we report 
a 3.47 WER or more than 8% relative improvement over the 
DNN baseline of 3.79 WER. Our model is easy to optimize and 
implement, and does not suffer from exploding gradients even 
with high learning rates. We also found that pretraining may 
not be necessary for DNNs, the DBN pretrained DNN achieved 
a 3.81 WER compared to our ReLU DNN without pretraining 
of 3.79 WER. We also experimented with ASGD with our TC- 
DNN-BLSTM-DNN model, we were able to match the SGD 
dev WER, however the WER on the evaluation set was signifi¬ 
cantly lower at 3.72. In future work, we seek to apply sequence 
training on top of our acoustic model to further improve the 
model accuracy. 
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